\section{Evaluation}
\label{sec:eval}

%In this section, we evaluated the performance of our library with benchmarks.

\subsection{Benchmarks and Experimental Settings}

We evaluated our template approach with typical procedures
in multimedia and scientific fields. Additionally,
we implemented a micro-benchmark to evaluate the performance of
pipeline processing. These benchmark programs are listed
as follows:

\begin{itemize}
\item \textit{saxpy}. A procedure in BLAS level 1, which represents a
scalar multiplies to a vector of 32 million single precision floats.

\item \textit{sgemm}. A procedure in BLAS level 3, which represents
two 4096*4096 dense matrices multiply.

\item \textit{dotprod}. Two vectors perform dot product. Each vector
  comprises of 32 million elements.

\item \textit{conv2d}. 2-Dimensional convolution operation on image.  The Image
  is in 4094*4096 black-white format. Pixel is normalized as a single
  float ranging from 0.0 to 1.0.

\item \textit{pipe}.  This micro-benchmark consists of 4 stages
and each stage consumes $t \  \mu s$ CPU time. The execution iterates the
pipeline 500 times.  Similar to \reffig{viewmt}, a stage can execute only when the preceding stage
finishes.  

%Pseudo-multi-language translation. A word is translated from one language A to language B, and then another function will translate it from language B to language C, etc.

\end{itemize}

% Consider the trend of C++,
%development of template library like libvina should become easier
%and smoother in the future.  

Two multicore platforms used in our experiments are described in Table~\ref{tbl:mach}.
The harpertown blade server contains two quad-core Xeon processors. 
On macbookpro, the embedded Nvidia GPU on motherboard contains two streaming multiprocessors,
each consists of 8 scalar processors(SP) and supports up to 128 threads.

\begin{table}[hbt]
\caption{Experimental platforms}\label{tbl:mach}
\begin{center}
\begin{tabular}{|l|l|l|l|l|r|}
\hline
\textbf{name}&\textbf{type}&\textbf{processors}&\textbf{memory}&\textbf{OS}\\
\hline
harpertown&SMP &x86 quad-core  &4G&Linux Fedora\\
                  &  server &   
2-way  2.0Ghz & &kernel 2.6.30\\
\hline
macbookpro&laptop &x86 dual-core &2G&Mac OSX\\
                    &            & 2.63Ghz         &  &Snowleopard\\
\cline{3-4}
                   &             &GPU: Nvidia      &256M & \\
                    &            &9400m 1.1Ghz     & &\\
\hline
\end{tabular} 
\end{center}
\end{table}

We developed and tested libvina using GCC 4.5 beta.  For GPU, we developed
our library with OpenCL~\cite{opencl} shipped with Mac OSX 10.6. 
On harpertown, we link Intel's Math Kernel Library (MKL) to perform BLAS procedures
if they are available\footnote{Sgemm, saxpy, and dotprod are available
in MKL and we use the sequential version in the experiments.}.
On macbookpro, we implement all the procedures on our own. 


\subsection{Libvina Performance}

%We evaluate our library using preceded programs.
%Experiment 1 applies our TF class to hierarchically divide
%tasks for CPU. Experiment 2 evaluates performance of 
%transformations for GPU. Experiment 3 evaluates the
%performance of a pipeline programs using TF\_pipeline.

\subsubsection{Speedup of hierarchical transformation on CPU}
\label{exp:1}

\begin{figure}
  \includegraphics{../speedupx86.0}
  \caption{Speedup of hierarchical transformation on CPU}
  \label{fig:spdx86}
\end{figure}

\reffig{spdx86} shows the speedup of programs that apply our
\code{TF\_hierarchy} class on harpertown. 
%All predicates are set to cater for CPU's last level cache (LLC).
Programs \textit{conv2d} and \textit{sgemm} show good performance scalability.
\textit{conv2d} obtains about 7.3 times speedup with 8 cores, because it does not
have any data dependences.
\textit{sgemm} achieves 6.3 times speedup using 8 cores, because it
needs an extra reduction for each subtask.  It is
worthy noting that there is almost two-fold speedup with two cores. 
When using four cores, Linux cannot guarantee that four tasks are distributed
on the same physical processor. As a result, the cost of memory accesses and synchronization
increase from two cores to four cores.

%the speedup degrates to 3.3 times when the number of core continuously
%doubles. Harpertown consists of two quad-core processors,  but Linux
%cannot guarantee that 4 tasks are distributed evenly in a physical
%processor. Therefore, the cost of memory accesses and synchronization
%increase from 2-core to 4-core.

\textit{dotprod} and \textit{saxpy} show low speedups because non-computation-intensive
programs are subject to memory bandwidth.  In average, \textit{saxpy} needs one load and one 
store for every two operations. \textit{dotprod} has similar execution patterns.
They quickly saturate memory bandwidth for our SMP system. 


\subsubsection{Speedup on GPU}\label{exp:2}

\begin{figure}
\includegraphics{../speedupgpu.0}
\caption{Speedup of transformations on GPU.}
\label{fig:spdgpu}
\end{figure}

\reffig{spdgpu} shows SPMD transformation results for GPU on
macbookpro. The running time of sequential programs running on host CPU is set to
be the baseline.
%Porting from CPU to GPU, developers only need to change
%template classes while keeping algorithms same \footnote{
%Because GPU code needs special qualifiers, we did modify kernel
%functions a little manually.  Most algorithms are kept same except for
%sgemm.  Because it is not easy
% to work out sgemm on platform macbookpro, we add blocking and SIMD
% instruments for CPU.}.
As figure depicted,  computation-intensive programs
\textit{sgemm} and \textit{conv2d} achieve 4.5 to 5 times speedup
by migrating to GPU.
\textit{saxpy} has about two times performance boost, while \textit{dotprod}
has no performance improvement. This is because
Nvidia GPUs execute threads in units of warp (each warp consists of 32 threads) on hardware and it is
possible to coalesce memory accesses if warps satisfy
specific access patterns~\cite{nvopencl}. For \textit{saxpy}, memory coalescence
mitigates bandwidth issue occurred on CPU counterpart. However, this is not
the case for \textit{dotprod}.
%Because our program of \textit{dotprod} has fixed
%step to access memory which does not fit any patterns, we cannot
%obtain hardware optimization without tweaking the algorithm.

\subsubsection{Performance of Pipeline Patterns on CPU}
\label{exp:4}

\begin{figure}[htp]
\includegraphics{../pipeline.0}
\caption{Performance of a four-stage pipeline on 8-core CPU.}
\label{fig:pipe}
\end{figure}

This experiment evaluates the performance of pipeline patterns of our template
library. Specifically, we choose micro-benchmark \textit{pipe} with varying
granularities for each stage, from 5 $\mu s$ to 100 $\mu s$. \reffig{pipe} shows
that when the time for each stage is larger than 20 $\mu s$, a pipeline can
roughly produce one output every $t$ $\mu s$. For instance, when $t=100$ $\mu s$,
the output is generated every 54 $\mu s$ for eight cores (two four-stage pipelines).

%\textit{I.e.},  when 4 cores are exposed to the system, our program can roughly output one instance every $t\  \mu
%s$. The speedup is easy to maintain when granularity is big. 100 $\mu s$ case ends up 54 $\mu s$ for each instance for 8 cores. 50  $\mu s$ case
%bumps at 5 cores and then improves slowly along core increment. 20
%$\mu s$ case also holds the trend of first two cases.
For fine-grained pipeline of 5 $\mu s$ per stage, we cannot observe ideal pipelining until all 8
cores are available.  Our Linux kernel scheduler's granularity is 80
$\mu s$ by default. In this case, fine-grained tasks contend for
CPU resources, incurring extra overhead. Many cores scenario help alleviate the
problem.

\comment{
\subsection{Comparison between different multicores}
\label{exp:3}

\begin{table}[hbt]
\caption{Comparison of sgemm on CPU and GPU}\label{tbl:sgemm}
\begin{tabular}{|l|r|r|r|}
\hline
& baseline& CPU & GPU\\
\hline
\textbf{Cores} &1 x86(penryn)& 8 x86(harpertown)& 2 SMs\\
\hline
\textbf{Gflops}& 2.64 &95.6&  12.0\\
\hline
\textbf{Effectiveness}&12.6\%& 74.9\%&68.2\%\\
\hline
\textbf{Lines of function}&63&unknown&21\\
\hline
\end{tabular}
\end{table}

Table~\ref{tbl:sgemm} details \textit{sgemm} execution on CPU and GPU. Dense matrix
multiplication is one of  typical programs which have
intensive computations.
%Problems with this characteristic are the most
%attractive candidates to apply our template-based approach.
%Our template library transforms the \textit{sgemm} for both CPU and 
%GPU using building blocks.
Sequential execution on macbookpro's CPU is used as baseline to compare with.
After mapping the algorithm to GPU, we obtain over 4.5 times speedup.
Theoretically, Intel Core2 processor can issue two SSE instructions per cycle. 
Therefore, the peak float performance is 21 Gflops on host CPU. We obtain 2.64 Gflops,
which utilizes about only 12.6\% of full capacity.
%even we employ quite complicated implementation.
On the other side, over 12 Gflops is observed on GPU whose
maximal performance is roughly 17.6 Gflops ($17.6Gflops = 1.1Ghz * 2(SM) *
8(SP))$~\footnote{Nvidia declared their GPUs can perform a mad (multiply-add
op) per cycle  for users who are concerned about performance over precision. However, we can
not observe mad hints bring any performance improvement in OpenCL.}.

Although both column 2 and column 4 implement SIMD algorithm for
\textit{sgemm}, GPU's version is obviously easier and effective. It is
due to the dynamic SIMD and thread management from GPU
hardware~\cite{Fatahalian08} can significantly ease vector programming. Programmer can
implement algorithm in plain C and then relies on template
transformation for GPU.  Adapting to GPU only need tens of lines code
efforts. Like GPU template, we change homebrewed sgemm to MKL
procedure and apply building blocks to parallelize it for CPU. We obtain 95.6 
Gflops and about 75\% effectiveness on harpertown server.
}
